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Abstract 



It is proposed that the co-expression of statistically significant mo- 
tifs among the sequences of a proteome is a phylogenetic trait. From 
the co-expression matrix of such motifs in a group of prokaryotic pro- 
teomes a suitable definition of a phylogenetic distance is introduced and 
the corresponding distance matrix between proteomes is constructed. 
From the distance matrix a phylogenetic tree is inferred, following a 
standard procedure. The inferred tree is compared with with a refer- 
ence tree deduced from a distance matrix obtained from the alignment 
of ribosomal RNA sequences. Our results are consistent with the hy- 
pothesis that biological evolution manifests itself with a modulation of 
basic correlations between shared peptides of short length, present in 
protein sequences. Moreover, the simple procedure we propose confirms 
that it is possible, sampling entire proteomes, to average the effects of 
lateral gene transfer and infer reasonable phylogenies. 

Key words: Genomics, whole-proteome phylogeny, fc-motifs, co- 
expression matrix. 



Each living species is the result of its evolution; this historical as- 
sumption is the basic tenet of modern evolutionary biology. Molecular 
systematics [1 3 aims at classifying living species by measuring differ- 
ences in their inherited molecular constituents, not in their phenotypic, 
macroscopic appearance. Since the classic paper by Zuckerkandl and 
Pauling [2], rational molecular systematics rests on the analysis of ac- 
tual sequences, which are, though in an indirect way, the archive of 
the evolutionary information about biological species (taxa). Differ- 
ences in nucleotide or amino acid sequences are the objective material 
elements to start from; that is particularly true in the classification of 
microscopic, unicellular organisms, where more macroscopic methods, 
based on ecological or pathological properties of these species, seem 
to be less fundamental Carl Woese, more than twenty years ago, 
has founded the universal molecular classification of living organisms 
based on ubiquitary co-evolved sequences, like e.g. the RNA of the 
small ribosomal subunit (SSU rRNA) [4 . The main achievement was 
the discovery of the fundamental tripartition of the tree-of-life into the 
branches of Bacteria, Archea and Eukarya [3]. Nowadays molecular 
phylogeny is a well established discipline, based on probabilistic meth- 
ods [Hj; nevertheless, the existence of lateral gene transfer [Zj|H] (i. e. 
mixing up of genes between species, particularly practiced among 

prokaryotes) has created some problems in the field and led some 
radical phylogenists to argue against the reliability of single-gene phy- 
logenies j2j ■ To cope with this problem there has been, since more than 
ten years ago, a growth of ideas and methods to infer molecular phy- 
logenies not from the analysis of groups of single genes, as classically 
done, but rather from the analysis of whole genomes and proteomes 
|lUj . This is the subject also of the present work. 

Proteomes are far from being a random assembly of peptides. Clus- 
tering of aminoacids [H], and strong correlations among genomic ^2] 
and proteomic ^3] segments have been clearly demonstrated. These 
results give meaning to the metaphor of protein sequences viewed as 
texts written in a still unknown language [Tlj . 

Following this view we assume that biological evolution could man- 
ifest itself, at the molecular level, through the modulation of significant 
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sequence elements that can be variously combined in the evolutionary 
declination of the language embodied by the proteins of living organ- 
isms. It is reasonable to consider as significant those tracts of a pro- 
tein sequence which exhibit a pronounced deviation from a random 
assembly of aminoacids. We have then looked for putatively significant 
elements as short peptide sequences of lenght k which occur, in a pro- 
teome, a number of times larger than expected in a random proteome. 
In this work we show that these peptides are acted upon by natural se- 
lection and display, in different proteomes, statistical correlations able 
to express evolutionary distances. 

A proteome P is a collection of np protein sequences, i.e. strings of 
various lengths made of symbols from an alphabet A of 20 letters: A = 
{<7i, <7 2 , o"2o}; eac h o labels one of the different aminoacids a protein 
is made of. If the proteome contains Np aminoacids (i. e. letters), then 
one can compute their relative frequencies: f((Ji) = ni/Np, Hi being 
the number of times the i-th aminoacid occurs in the proteome and % = 
1,2, ...,20. We define as /c-peptides sequences of k contiguous letters. 
The number of all possible fc-peptides is 20 fc . From the proteome P we 
can only select Np — np ■ (k — 1) overlapping /c-peptides; some of them 
occur once, others more than once, doing so in the same or in different 
different proteins. 

Denoting with pf^ = {cr^, a 2j , <J kj } the j-th /c-peptide we can 

count the number N^ of times it occurs in the actual proteome. We 

can also estimate the expected number of occurrences iVj e) of the j-th 
/c-peptide in a random proteome (of the same length and with the same 
number of /c-peptides), generated by independent random extractions 
of letters with the constraint of producing, on the average, prescribed 
relative frequencies f{ai). That is: 

Nf = [(Np-n P -(k-l)].PT[pf}, (1) 

where Pr[p^], the probability of occurrence of the j-th /c-peptide, can 
be estimated as /(er^.) • /(o"2 3 ) • ••• • /(cfc^), i- e. as the product of the 
relative frequencies of its component letters in the actual proteome. 
For each /c-peptide of expected occurrence N^ e \ the probability that 
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it is observed N times in a random proteome (with the same amino 
acidic composition and sequences of the same lengths as in the actual 
proteome) is given by a Poissonian distribution: 

Pr me) [N]= [ - 1 ^-.ex P [-N^}. (2) 

We define as statistically-relevant the over-expressed fc-peptides whose 
observed number of occurrences is such that: 

/ Pr N te)\N']dN' >0.95 (3) 
Jo 

(i.e. the observed occurrence of the fc-peptide falls in the upper five 
percent tail of its Poissonian distribution). Let us call hereafter k- 
motifs the over-expressed fc-peptides, selected following inequality (3). 
Analogously, we have defined a test-set of fc-peptides which, differently 
from /c-motifs, are expressed as expected (i.e. « N^). We call 

these peptides expected fc-peptides. fc-motifs and expected /c-peptides 
are differently distributed along protein sequences: fc-motifs are seldom 
alone in a protein and in many cases they partly overlap forming longer, 
potentially significant, tracts. They occur at specific distances one 
from the other, whereas expected fc-peptides are isolated and dispersed, 
without any recurrent clustering. As an example, in fig.l we show the 
occurrence of fc-motifs in an archaeal protein. 

The non trivial statistical properties of the k- motifs suggest that, 
among the fc-peptides present in a proteome, they could display pat- 
terns of correlated expression useful to derive phylogenetic distances 
between taxa. 

We have considered eighteen proteomes from the Gene Bank |15j . 
Ten from Archaea Aeropyrum pernix, Archaeoglobus fulgidus, Halobac- 
terium spNRCl, Methanococcus jannaschii, Methanobacterium ther- 
moautotrophicum, Pyrobaculum aerophilum, Pyrococcus abyssi, Pyro- 
coccus furiosus, Sulfolobus solfataricus, Thermoplasma acidophilum and 
eight from Bacteria: Agrobacterium tumefaciens, Bacillus subtilis, 
Chlorobium tepidum, Deinococcus radiodurans, Escherichia coli K12, 
Synechocystis spPCC6803, Thermotoga maritima, Yersinia pestis C092 



3 



. We have selected fc-motifs from all these proteomes and collected 
them into k- dictionaries. Let Z n {k) be the subset of the /c-dictionary 
composed by those fc-motifs which are expressed in, at least, n differ- 
ent proteomes. Z\(k) is thus the entire set of fc-motifs, referring to the 
considered group of proteomes. Z2(k) is the set of fc-motifs common to 
at least two proteomes; Zi(k) — Z 2 {k) is, therefore, the subset of the k- 
motifs specific to one proteome. Fig. 2 reports, as a function of k, the 
number of entries in different fc-dictionaries, normalized over the total 
number of expressed fc-peptides (Zi(k)). It is worth noting that, as k 
increases, the proteome-specific fc-motifs (Zi(k) — ^(fc)) rapidly over- 
whelm the shared /c-motifs (i.e. Z & (k)). The Z 2 (k) dictionary (open 
circles in fig. 2) has a significant number of entries for low and interme- 
diate k values, as it contains almost 10% of the expressed peptides, for 
k = 6. 

We define now the co-expression matrix of the set Z n (k) in a pro- 
teome P the matrix A^ p > [Z n (k)\\ its element ij counts the number of 
times i-th and j-th fc-motifs, from Z n (k), occur together in one of the 
proteins of P. This matrix resembles the adjacency matrix of the net- 
work formed by linking words when they occur in the same phrase in 
texts written in natural languages, as done in recent linguistic studies 
In a subsequent more extended paper we shall present the statis- 
tical properties of the linguistic co-expression networks built on sets of 
fc-motifs [T7j . 

The pattern of co-expression matrices based on Z n (k), for a given 
value of k, is far from trivial in all the considered proteomes, with many 
groups of fc-motifs co-expressed in one or more proteins up to several 
tens of times. On the other hand we have noticed that co-expression 
matrices generated by equally-populated sets of expected fc-peptides 
are sparse, with just very few and tiny elements different from zero. 

The different co-expression patterns of fc-motifs in different pro- 
teomes are the basis of the method we propose in this letter. 

The observations reported above might be resumed as follows: (1) 
there is a consistent set of fc-motifs which are common among the 
considered organisms of a given kingdom; this set might constitute 
a sort of basic dictionary collecting robust pieces of information, stable 
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across the taxa; (2) there is a larger set of proteome-specific fc-motifs 
[Zx(k) — Z 2 (k)], whose evolution occurred within a specific taxon and 
might be considered as the manifestation of a linguistic specificity of 
that species; (3) there is a consistent set of fc-motifs, Z 2 (k), containing 
the common fc-motifs together with a number of /c-motifs which are 
quite specific but, nevertheless, common to a few species. 

It is reasonable to assume that common and proteome-specific k- 
motifs somehow interact: the usage of proteome-specific terms might 
influence the usage of the common fc-motifs, in the sense that the co- 
expression of the latter might be modulated by usage of the former, 
giving origin to a specific co-expression pattern. 

Let us propose now a definition of the phylogenetic distance among 
proteomes. From A^[Z n (k)], the symmetric co-expression matrix of a 
given proteome P, we can extract a co-expression vector V^ p '[Z n (k)], 
whose components are the Uk{nk + l)/2 distinct entries of the matrix 
(7^ is the number of k- motifs in Z n (k)), ordered in an arbitrary but 
fixed way, e. g. by rows: 

V< F >[Z n (k)]=A<F[{Z n (k)] (4) 

with j > i and s ranging from one to nk{nk + 1)/2. We consider the co- 
expression vector as a linguistic fingerprint of a proteome expressing its 
peculiar use of both common and proteome-specific motifs. We define 
a phylogenetic distance dp/p//{k) between two proteomes P' and P" 
through the scalar product [IB] of their co-expression vectors based on 
a Zj(k) dictionary: 

dp/ P //(j,k) = 1 -5Z{Vj^ [^(A;)] .yj^ [Zn(A .)] }/{ | V C^) | . | | } (5) 

s 

In this work we have evaluated phylogenetic distances among a set 
of prokaryotic proteomes, using the Z 2 (6) dictionary. There are several 
arguments to motivate this choice for the probe-set of motifs. The Z 2 (6) 
dictionary of the set of prokaryotes we are considering has 7712 entries; 
it contains a balanced mixture of common and proteome-specific tracts. 
The use of a Zj(6) dictionary, with j > 2 would have produced a dis- 
tance evaluation on the only basis of strongly conserved motifs, i.e. 
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those common to a large number of organisms, disregarding the mod- 
ulation effect that they could produce on the proteome-specific tracts. 
These dictionaries, moreover, are quite small (the Zq(6) dictionary, for 
instance, contains only 55 entries) and their size markedly decreases 
with the increase of both j and k. On the other side, Z 2 {k) dictionar- 
ies with k < 6 are made by a large number of motifs (e. g. Z 2 (5) has 
161903 entries), but the potential increase in sensitivity, putting aside 
practical considerations due to the treatment of large matrices, would 
be spoiled by some volatility of low k motifs. Due to the rigidity of the 
statistical criterion (3) (same acceptance threshold for all the motifs) 
one /c-motif which has passed the test and belongs to the /c-dictionary 
could pass the test for the k + 1 dictionary as part of one of the 40 
{k + l)-peptides which can be obtained by adding one letter at its be- 
ginning or at its end. We have observed that this is rarely the case for 
low k. So, if k is too low then many short peptides, which are accepted 
as statistically significant and could be the nucleus of biologically rel- 
evant tracts of sequence, are lost and not recognized in the k + 1 test. 
When k is larger than 5, the motifs have been seen to be more stable, 
in that they generally appear as part of longer motifs also in the k + 1 
dictionary. Indeed, some of them are also "lost" but this can be the 
sign of the "end" of the specific tract. The Z 2 {Q) dictionary is thus a 
good trade-off between number of entries and balance of common and 
proteome-specific tracts. Moreover, k = 6 seem to be a peculiar length 
for peptides: it has been proven that 6-peptides allow a unique recon- 
struction of a protein sequence from the collection of its constituent 
/^-peptides [T9*] . 

By using the definition of eq. (5) we have evaluated all the distances 
between the considered set of proteomes. The resulting distance ma- 
trix has been processed by the neighbor-joining method [20, using the 
PHYLIP package [21]. The dendrogram we have obtained through the 
procedure outlined above is shown in Fig. 3. In Fig. 4 we show the tree 
obtained, for the same set of taxa, from the server of the Ribosomal 
Database Project [22]- This last phylogeny can be assumed as a refer- 
ence, because it is based on the alignment of sequences of RNA from the 
Small Ribosomal Sub unit. This molecule is ubiquitary and coevolved to 
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accommodate a well defined set of ribosomal proteins, hardly subject to 
lateral gene transfer. The tree from alignment of the SSUrRNA shows 
the clear separation of the two kingdoms: Archaea and Bacteria, this 
separation appear less clearly resolved in our tree whose center seems 
to be an archaeal spot from which emerge Sulfolobus solfataricus, a 
branch of 5 bacteria, a group of Archaea with the bacterium Deinococ- 
cus radiodurans (D. radi) among them, and a group of Archaea with 
the bacteria Chlorobium tepidum (C.tepi) and Synechocystis (Synech) 
segregated among them. One could be discouraged by this result and 
think that the method we are proposing is unable to resolve the basic 
tripartition of the tree of life and that we are mistakenly classifying 
taxa. One could argue, from a different perspective, that the kind 
of method we are proposing, based on global statistical properties of 
the proteomes, is able to reveal phylogenetic associations which are at 
variance with the fundamental SSUrRNA classification. The stability 
of the method and its biological foundations have to be further inves- 
tigated. However it is worth noting that, quite surprisingly, the tree 
we have reconstructed through a biologically blind criterion mutuated 
from statistical linguistics can be reasonably compared with those ob- 
tained through refined and deep whole-genome analyses 23j[24j. In 
particular we believe that whole-genome phylogenies of the kind we 
are proposing should be confronted with very recent observations sug- 
gesting that eukaryotes could originate from the fusion of pre-existing 
prokaryotic genomes j2Hj. Moreover, the important distinction between 
operational and informational genes [2E] suggests that we are looking 
at a possible different statistics of occurrence of the fc-motifs, which 
are the probe of our method, over the two kinds of proteins; we also 
believe that blind approaches based on the statistics of short sequence 
motifs, as the one we present here, could be less affected by different 
sources of bias which are however present in statistical phylogenomic 
studies based on the clustering of entire genes [27j . 

In the last stage of the preparation of this manuscript we became 
aware of an important study which uses an approach very close to ours 
and which has been made available on line j29j . In that method 
a proteome is also sampled for statistically significant 6-peptides; the 
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background constituted by an uncorrelated random extraction of let- 
ters is subtracted. The fingerprint vector of each proteome has 20 6 
components, each one of them expresses the statistical deviation of 
the occurrence of each peptide from that expected in a random pro- 
teome. In our approach the fingerprint is represented instead by the 
co-expression vector. Following the method proposed in we have 
derived a distance matrix of the same 18 species here investigated; 
the phylogenetic tree we have obtained has a more resolved dichotomy 
between Archaea and Bacteria, and a topology which, though more 
consistent is still less resolved and definitely not coincident with that 
of a tree obtained from the distance matrix based on the alignment of 
the SSUrRNA [22] • It will be interesting to proceed, in the next fu- 
ture, to a careful assessment of the biological information which can be 
derived from the two approaches. At present we tend to have the fol- 
lowing view: the phylogenetic picture based on the tree of life has been 
put under scrutiny by the large extent of lateral gene transfer between 
taxa; that challenged phylogenists, using properly selected groups of 
genes, to reveal evolutionary relations which are not consistent with 
the universal tree of life. Recently there have been claims for the tree 
of life to fuse into what has been called the ring of life ■ 

Phylogenies based on whole genomes are coherent with the view of 
the three kingdoms Archaea, Bacteria and Eukarya as originating from 
a world based on gene exchange and fusion of genomes. In particular, 
testing entire proteomes against patterns of correlated expression of 
statistically significant sequence motifs seems to be a proper way to 
cope with the original genome fusion regime and with the mean field 
generated by lateral gene transfers, gene duplication and lost. The 
method proposed in |28j . samples in a more generic way the evolution- 
ary correlations between fc-motifs and seem to force the trees toward 
the tree of life shape. Our method, based on patterns of co-expression 
of fc-peptides could be more in agreement with the view of a fusion- 
based ring of life. Of course, quantitative comparison between different 
methods is now really required, we are planning an extensive quanti- 
tative investigation of the relative merits of different approaches in 
recontructing the philogeny(ies) of a properly selected set of taxa. In 
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doing that a clear mathematical setting is of tantamount importance 

139 EH- 

The scientific content of phylogenies that are based on the statis- 
tical sampling of entire proteomes and that avoid sequence alignment 
algorithms has still to be validated. Nevertheless we believe that they 
can have a practical relevance at least as tools for the rapid molec- 
ular classification of the ever increasing number of freshly sequenced 
genomes. 
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CGKTTT DEPLSN DRIAVM DQVEAM 
GKTTTL GPSGCG GQRQRV GDRIAV 
KTTTLR LGPSGC LLGPSG LLMDEP 
LRMIAG MDEPLS MIAGLE MVFQSY 
PHMTVY QLSGGQ QRQRVA QVEAMT 
SNLDAK TIYVTH TLRMIA TTIYVT 
VEAMTM VFQSYA VLLGPS VTHDQV 



EPLSNL EAMTMG GCGKTT GGQRQR 

HDQVEA IYVTHD IAFPLK IAGLEE 

LMDEPL LSGGQR LSNLDA LDAKLR 

NIAFPL NLDAKL PLSNLD PSGCGK 

RIAVMN RMIAGL SGCGKT SGGQRQ 

TTLRMI TTTLRM THDQVE VLLMDE 

YVTHDQ AQLSGG PAQLSG 



>gi | 14520241 | ref |NP_125715 . 1 | hypothetical MALTOSE 

/MALTODEXTRIN TRANSPORT ATP-BIND ING [Pyrococcus abyssi] 

MVEVRLENLTKKFGNFTAVNKLNLT IKDGEFLVLLGPSGCGKTTTLRMIAGLEEPTE 
GKI YFGDREVTYLPPRERNI SMVFQSYAVWPHMTVYDNIAFPLKIKKFPRDEIDKRV 
RWAAE L L Q I E E L LD R YP AQLSGGQRQRVAVAR A I WE P D VLLMDEP LSNLDAKLRVA 
MRAEIKKLQQKLKVTTIYVTHDQVEAMTMGDRIAVMNRGQLLQVGPPTEVYLKPNSV 

FVATFIGAPEMNIVEVSVGDGYLEGKGFKIELPQDIMELLRDYIGKTVLFGIRPEHM 
TVEGVSELAHMKKTAKLNAKVDFVEALGTDTILHVKFGDELVKVKLPGHIP IEVGKE 
VTIVIDLDMMHVFDKDTEKAI I 



Figure 1: A typical dispersion of 6-motifs (bold) in a protein of the 
archaeon P. abyssi. In the upper part of the figure are reported the 55 
6-motifs, belonging to Z 2 (6), which are expressed in the protein. Note 
the clustering and overlap of the motifs. 
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Figure 2: Relative fraction of /c-motifs present in the different subsets 
of the k-dictionary: Z\(k) — ^(^) (black circles),^ (k) (white circles), 
Z 6 (k) (squares). 




Figure 3: Unrooted phylogenetic tree of the considered proteomes: the 
full name of the species can be easily reconstructed from the abbrevia- 
tions. After a / the kingdom is indicated: A stands for Archaea, B for 
Bacteria. 
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Figure 4: SSUrRNA phylogeny of the 18 species here considered; from 
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